Overview

This project is intended for CAP6768 - Data Analytics - Discussion 6

For this discussion activity, perform the following tasks:

  1. From the listing of datasets, select a dataset that is conducive for regression analysis. You will have to scroll down a bit to see a listing of datasets that are relevant to regression analysis. Select a dataset that has 30 or more rows of records. Name and describe the selected dataset. State number of records and variables available in the dataset. Explain why the selected dataset is a good exemplar data for practicing regression analysis.

Note: Read other students’ posts before selecting your dataset for this discussion activity. If another student has chosen a dataset, then you will have to choose a different dataset.

  1. State the investigative question in relevance to the selected dataset and regression analysis. State the relationships between variables that you intend to investigate.

  2. Perform appropriate regression analysis to conduct the stated investigation. Explain the tooling used and processes used for performing regression analysis.

  3. Provide the regression analysis results and explain the findings in relevance to the investigative questions posed.

Dataset

For this discussion, I’ve choosen the dataset “Spread of Shotgun Pellets by Distance of Shot”. This dataset looks at the measured spread of shotgun pellets for different distances from a target.

We start by loading the downloaded datafile and renaming the columns to match the description.

We display to resulting dataset below.

shotgun_data <- read.table("shotgun_spread.dat", header = FALSE, skip = 3)
colnames(shotgun_data) <- c("cartridge", "range", "sqrt_spread", "std_of_sqrt_spread_for_range")

shotgun_data <- shotgun_data %>%
  mutate(
    cartridge = case_when(
      cartridge == 1 ~ "Winchester Western Super X 00",
      cartridge == 2 ~ "Remington No. 4"
    )
  )

shotgun_data %>% datatable()

Analysis

In order to better understand the dataset, we start by plotting the data onto a scatterplot.

shotgun_data %>%
  plot_ly(x = ~range, y = ~sqrt_spread, color = ~cartridge) %>%
  layout(
    title = "Square Root of the Spead verses Range",
    xaxis = list(title = "Range"),
    yaxis = list(title = "Sqrt. Spread")
  )
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

As we can see, there is a clear positive correlation between the range and the square root of the spread. We also notice that the increase in range correlates withe a wider variance of values for the sqrt of the spread. We can show this by looking at the std deviation of the sqrt of the spread for each range. Luckily, this is already provided in the dataset.

shotgun_data %>%
  plot_ly(x = ~range, y = ~std_of_sqrt_spread_for_range, color = ~cartridge) %>%
  layout(
    title = "Std. of Sqrt of Spread verses Range",
    xaxis = list(title = "Range"),
    yaxis = list(title = "Std. of Sqrt of Spread")
  )
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

As we can see, there is an exponential relation between the Range and the square root of the spread.

Regression

We will use regression to measure the correlation between these variables.

model_1 <- shotgun_data %>% 
  lm(sqrt_spread ~ range, data = .)

shotgun_data %>%
  plot_ly(x = ~range) %>%
  add_markers(y = ~sqrt_spread, color = ~cartridge) %>%
  add_lines(x = ~range, y = fitted(model_1), name = "Regression") %>%
  layout(
    title = "Square Root of the Spread verses Range",
    xaxis = list(title = "Range"),
    yaxis = list(title = "Sqrt. Spread")
  )
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
summary(model_1)
## 
## Call:
## lm(formula = sqrt_spread ~ range, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8804 -0.6920 -0.1348  0.5280  5.4496 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.13357    0.50110  -0.267    0.791    
## range        0.29628    0.01476  20.073   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.529 on 55 degrees of freedom
## Multiple R-squared:  0.8799, Adjusted R-squared:  0.8777 
## F-statistic: 402.9 on 1 and 55 DF,  p-value: < 2.2e-16

We are also able to fit an exponential regression to each of the spread outputs. We will create two different models for each of the cartridge types.

cartridge1_data <- shotgun_data %>% filter(cartridge == 'Winchester Western Super X 00')
cartridge2_data <- shotgun_data %>% filter(cartridge == 'Remington No. 4')

model_2 <- cartridge1_data %>% 
  lm(std_of_sqrt_spread_for_range ~ range, data = .)


model_3 <- cartridge2_data %>% 
  lm(std_of_sqrt_spread_for_range ~ range, data = .)

Cartridge 1, Winchester Western Super X 00 Regression Model

cartridge1_data %>%
  plot_ly(x = ~range) %>%
  add_markers(y = ~std_of_sqrt_spread_for_range, color = ~cartridge) %>%
  add_lines(x = ~range, y = fitted(model_2), name = "Regression") %>%
  layout(
    title = "Std. of the Spread verses Range",
    xaxis = list(title = "Range"),
    yaxis = list(title = "Std. of the Spread")
  )
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
summary(model_2)
## 
## Call:
## lm(formula = std_of_sqrt_spread_for_range ~ range, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14863 -0.10080 -0.05646  0.16403  0.28371 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.663886   0.079937  -8.305 1.18e-08 ***
## range        0.049917   0.002297  21.732  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1569 on 25 degrees of freedom
## Multiple R-squared:  0.9497, Adjusted R-squared:  0.9477 
## F-statistic: 472.3 on 1 and 25 DF,  p-value: < 2.2e-16
cartridge2_data %>%
  plot_ly(x = ~range) %>%
  add_markers(y = ~std_of_sqrt_spread_for_range, color = ~cartridge) %>%
  add_lines(x = ~range, y = fitted(model_3), name = "Regression") %>%
  layout(
    title = "Std. of the Spread verses Range",
    xaxis = list(title = "Range"),
    yaxis = list(title = "Std. of the Spread")
  )
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
summary(model_3)
## 
## Call:
## lm(formula = std_of_sqrt_spread_for_range ~ range, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3450 -0.0762  0.0426  0.0924  0.2862 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.544400   0.092351  -5.895 2.43e-06 ***
## range        0.067120   0.002784  24.105  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2157 on 28 degrees of freedom
## Multiple R-squared:  0.954,  Adjusted R-squared:  0.9524 
## F-statistic:   581 on 1 and 28 DF,  p-value: < 2.2e-16